NAR Genomics and Bioinformatics — Latest Matching Preprints

1

Effects of technical noise on bulk RNA-seq differential gene expression inference

Sheerin, D.; O'Connor, D.; Pollard, A. J.; Mohorianu, I.

2019-11-16 bioinformatics 10.1101/843789 medRxiv

Top 0.1%

54.3%

Show abstract

MotivationInconsistent, analytical noise introduced either by the sequencing technology or by the choice of read-processing tools can bias bulk RNA-seq analyses by shifting the focus to the variation in expression of low-abundance transcripts; as a consequence these highly-variable genes are often included the differential expression (DE) call and impact the interpretation of results. ResultsTo illustrate the effects of "noise", we present simulated datasets following closely the characteristics of a H.sapiens and a M.musculus dataset, respectively, highlighting the extent of technical-noise in both a high inter-individual variability (H. sapiens) and reduced variability (M. Musculus) setup. The sequencing-induced noise is assessed using correlations of distributions of expression across transcripts; analytical noise is evaluated through side-by-side comparisons of several standard choices. The proportion of genes in the noise-range differs for each tool combi-nation. Data-driven, sample-specific noise-thresholds were applied to reduce the impact of low-level variation. Noise-adjustment reduced the number of significantly DE genes and gave rise to convergent calls across tool combinations. AvailabilityThe code for determining the sequence-derived noise is available for download from: https://github.com/yry/noiseAnalysis/tree/master/noiseDetection_mRNA; the code for running the analysis is available for download from: https://github.com/sheerind/noise_detection.

2

Predicting Relative Protein Abundance via Sequence-Based Information

Parkes, G.; Ewing, R. M.; Niranjan, M.

2021-11-08 bioinformatics 10.1101/2021.11.08.467260 medRxiv

Top 0.1%

52.5%

Show abstract

Understanding the complex interactions between transcriptome and proteome is essential in uncovering cellular mechanisms both in health and disease contexts. The limited correlations between corresponding transcript and protein abundance suggest that regulatory processes tightly govern information flow surrounding transcription and translation, and beyond. In this study we adopt an approach which expands the feature scope that models the human proteome: we develop machine learning models that incorporate sequence-derived features (SDFs), sometimes in conjunction with corresponding mRNA levels. We develop a large resource of sequence-derived features which cover a significant proportion of the H. sapiens proteome, demonstrate which of these features are significant in prediction on multiple cell lines, and suggest insights into which biological processes can be explained using these features. We reveal that (a) SDFs are significantly better at protein abundance prediction across multiple cell lines both in steady-state and dynamic contexts, (b) that SDFs can cover the domain of translation with relative efficiency but struggle with cell-line specific pathways and (c) provide a resource which can be plugged into many subsequent protein-centric analyses.

3

Reference-free Analysis of scRNA-seq Data Reveals Elevated rRNA and mtRNA Transcription during Neurogenesis in Axolotl

Ratul, M. R. Z.; Karim, M. R.; Samee, M. A. H.; Rahman, A.

2024-11-18 bioinformatics 10.1101/2024.11.18.624113 medRxiv

Top 0.1%

50.7%

Show abstract

Analysis of single-cell RNA-seq data is typically performed on a gene expression matrix estimated by aligning reads to a reference transcriptome. However, this approach is difficult to apply to organisms with no or incomplete reference transcriptomes. In addition, events deviating from the reference remain undetected. Here we present a reference-free method to analyze single-cell RNA-seq data based on k-mers. We assess the performance of our method on a metastatic renal cell carcinoma dataset and find that it is largely able to capture differentially expressed genes. We then analyze a recently generated dataset to study neurogenesis in Axolotl and observe increased levels of transcription of rRNA and mtRNA during neurogenesis as well as a miRNA with previously predicted links to neuronal development. We also detect lncRNAs and intron retention in heart disease-related genes in diseased cardiomyocytes in an analysis of a congenital heart disease dataset.

4

Improved sensitivity and resolution of ATAC-seq differential DNA accessibility analysis

Sheikh, A. A.; Blais, A.

2022-03-16 genomics 10.1101/2022.03.16.484118 medRxiv

Top 0.1%

45.3%

Show abstract

Eukaryotic genomes are packaged into chromatin, and the extent of its compaction must be modulated to allow several biological processes such as gene transcription. The regulatory elements of expressed genes are typically in relatively accessible chromatin, and several studies have revealed a reliable correlation between the abundance of mRNA transcripts and the degree of DNA accessibility at the regulatory elements of their coding genes. In consequence, the genome-wide profiling of DNA accessibility by methods such as ATAC-seq can help in the study of gene regulatory networks by serving as a proxy for gene expression and by helping identify important gene cis-regulatory elements and the trans-acting factors that bind them. The predominant approach used to identify differentially accessible genomic loci from ATAC-seq data obtained in two conditions of interest is comparable to that employed in RNA-seq gene expression profiling studies: accessible regions are identified through peak calling and treated like "genes", then sequenced DNA fragments (originating from two neighboring transposase insertion events) that overlap them are counted and subjected to abundance modeling, which then allows to identify those that have a significant difference between the two conditions. We reasoned that this approach could be improved in terms of sensitivity and resolution by introducing two changes: bypassing peak calling, using instead a genome-wide sliding window quantification approach, and counting transposase insertion sites, instead of fragments originating from two neighboring insertion sites. We present the development of this approach, which we term "widaR", for Window- and Insertion-based Differential Accessibility in R, using a murine skeletal myoblast differentiation dataset. Reproducible R code is provided.

5

Unveiling the Terra Cognita of Sequence Spaces using Cartesian Projection of Asymmetric Distances

Ramette, A.

2025-09-09 bioinformatics 10.1101/2025.09.04.674223 medRxiv

Top 0.1%

39.5%

Show abstract

Visualizing relationships within massive biological datasets remains a significant challenge, particularly as sequence length and volume increase. We introduce CAPASYDIS (Cartesian Projections of Asymmetric Distances), a scalable approach designed to map the explored regions of a given sequence space. Unlike traditional dimensionality reduction methods, CAPASYDIS calculates asymmetric distances which account for both the position and type of sequence variations. It projects sequences into a fixed, low-dimensional coordinate system, termed a "seqverse", where each sequence occupies a permanent location. This design allows for the instant mapping of new sequences without the need to recalculate the global structure, transforming sequence analysis from a relative comparison into navigation on a standardized map. We applied this method to a large rRNA sequence dataset spanning the three domains of life. Our results demonstrate that the sequences of Bacteria, Archaea, and Eukaryota occupy spatially distinct regions characterized by fundamentally different shapes and patterns of variation. Furthermore, the resulting seqverses retain high amount of taxonomic information, when analyzed from broad domain levels to single-base differences. Overall, CAPASYDIS provides a reproducible, scalable framework for defining the boundaries and topography of biological sequence universes.

6

Seqpac: A New Framework for small RNA analysis in R using Sequence-Based Counts

Skog, S.; Orkenby, L.; Tariq, K.; Ostlund Farrants, A.-K.; Ost, A.; Natt, D.

2021-03-21 bioinformatics 10.1101/2021.03.19.436151 medRxiv

Top 0.1%

38.7%

Show abstract

Small RNA sequencing (sRNA-seq) has become important for studying regulatory mechanisms in many cellular processes. Data analysis remains challenging, mainly because each class of sRNA--such as miRNA, piRNA, tRNA- and rRNA-derived fragments (tRFs/rRFs)--needs special considerations. Analysis therefore involves complex workflows across multiple programming languages, which can produce research bottlenecks and transparency issues. To make analysis of sRNA more accessible and transparent we present seqpac: a tool for advanced group-based analysis of sRNA completely integrated in R. This opens advanced sRNA analysis for Windows users--from adaptor trimming to visualization. Seqpac provides a framework of functions for analyzing a PAC object, which contains 3 standardized tables: sample phenotypic information (P), sequence annotations (A), and a counts table with unique sequences across the experiment (C). By applying a sequence-based counting strategy that maintains the integrity of the fastq sequence, seqpac increases flexibility and transparency compared to other workflows. It also contains an innovative targeting system allowing sequence counts to be summarized and visualized across sample groups and sequence classifications. Reanalyzing published data, we show that seqpacs fastq trimming performs equal to standard software outside R and demonstrate how sequence-based counting detects previously unreported bias. Applying seqpac to new experimental data, we discovered a novel rRF that was down-regulated by RNA pol I inhibition (anticancer treatment), and up-regulated in previously published data from tumor positive patients. Seqpac is available on github (https://github.com/Danis102/seqpac), runs on multiple platforms (Windows/Linux/Mac), and is provided with a step-by-step vignette on how to analyze sRNA-seq data.

7

Utilising Nanopore direct RNA sequencing of blood from patients with sepsis for discovery of co- and post-transcriptional disease biomarkers

He, J.; Ganesamoorthy, D.; Chang, J. J.-Y.; Zhang, J.; Trevor, S. L.; Gibbons, K. S.; McPherson, S. J.; Kling, J. C.; Schlapbach, L. J.; Blumenthal, A.; The RAPIDS Study Group, ; Coin, L. J. M.

2024-12-14 genetic and genomic medicine 10.1101/2024.12.13.24318230 medRxiv

Top 0.1%

34.6%

Show abstract

BackgroundRNA sequencing of whole blood has been increasingly employed to find transcriptomic signatures of disease states. These studies traditionally utilize short-read sequencing of cDNA, missing important aspects of RNA expression such as differential isoform abundance and poly(A) tail length variation. MethodsWe used Oxford Nanopore Technologies long-read sequencing to sequence native mRNA extracted from whole blood from 12 patients with suspected bacterial and viral sepsis, and compared with results from matching Illumina short-read cDNA sequencing data. Additionally, we explored poly(A) tail length variation, novel transcript identification and differential transcript usage. ResultsThe correlation of gene count data between Illumina cDNA and Nanopore RNA-sequencing strongly depended on the choice of analysis pipeline; NanoCount for Nanopore and Kallisto for Illumina data yielded the highest mean Pearsons correlation of 0.93 at gene level and 0.74 at transcript isoform level. We identified 18 genes significantly differentially polyadenylated and 4 genes with significant differential transcript usage between bacterial and viral infection. Gene ontology gene set enrichment analysis of poly(A) tail length revealed enrichment of long tails in signal transduction and short tails in oxidoreductase molecular functions. Additionally, we detected 594 non-artifactual novel transcript isoforms, including 9 novel isoforms for Immunoglobulin lambda like polypeptide 5 (IGLL5). ConclusionsNanopore RNA- and Illumina cDNA-gene counts are strongly correlated, indicating that both platforms are suitable for discovery and validation of gene count biomarkers. Nanopore direct RNA-seq provides additional advantages by uncovering additional post- and co-transcriptional biomarkers, such as poly(A) tail length variation and transcript isoform usage.

8

Regulus, a transcriptional regulatory networks inference tool based on Semantic Web technologies

Louarn, M.; Collet, G.; Barre, E.; Fest, T.; Dameron, O.; Siegel, A.; Chatonnet, F.

2021-08-03 bioinformatics 10.1101/2021.08.02.454721 medRxiv

Top 0.1%

34.3%

Show abstract

MotivationTranscriptional regulation is performed by transcription factors (TF) binding to DNA in context-dependent regulatory regions and determines the activation or inhibition of gene expression. Current methods of transcriptional regulatory networks inference, based on one or all of TF, regions and genes activity measurements require a large number of samples for ranking the candidate TF-gene regulation relations and rarely predict whether they are activations or inhibitions. We hypothesize that transcriptional regulatory networks can be inferred from fewer samples by (1) fully integrating information on TF binding, gene expression and regulatory regions accessibility, (2) reducing data complexity and (3) using biology-based logical constraints to determine the global consistency of the candidate TF-gene relations and qualify them as activations or inhibitions. ResultsWe introduce Regulus, a method which computes TF-gene relations from gene expressions, regulatory region activities and TF binding sites data, together with the genomic locations of all entities. After aggregating gene expressions and region activities into patterns, data are integrated into a RDF endpoint. A dedicated SPARQL query retrieves all potential relations between expressed TF and genes involving active regulatory regions. These TF-region-gene relations are then filtered using a logical consistency check translated from biological knowledge, also allowing to qualify them as activation or inhibition. Regulus compares favorably to the closest network inference method, provides signed relations consistent with public databases and, when applied to biological data, identifies both known and potential new regulators. Altogether, Regulus is devoted to transcriptional network inference in settings where samples are scarce and cell populations are closely related. Regulus is available at https://gitlab.com/teamDyliss/regulus

9

Bio informatics: Integrate negative controls to get the good data

van Nues, R. W.

2024-10-09 bioinformatics 10.1101/2024.10.08.617225 medRxiv

Top 0.1%

34.2%

Show abstract

High-throughput datasets, like any experimental output, can be full of noise. Negative controls, i.e. mock experiments not providing information concerning the biological system under study, visualise background. Overlooking this training set of wrong examples in publicly available datasets can seriously undermine validity of bioinformatics analyses. We present a program, COALISPR, for explicit and transparent application of negative control data in the comparison of high-throughput sequencing results. This yields mapping coordinates that guide fast counting of reads, bypassing the need for a reference file, and is especially relevant when small RNA sequencing libraries contaminated with breakdown products are analysed for poorly annotated organisms. We have re-analysed small RNA datasets for mouse and fungus Cryptococcus neoformans, leading to consistent identification of miRNAs and of fungal transcripts targeted by siRNAs. Cryptococcal Argonautes are directed to spliced transcripts indicating that RNAi must be triggered by events downstream of intron removal. Negative control datasets contain large amounts of ribosomal RNA (rRNA) fragments (rRFs). These differ from small RNAs associated with RNAi, making a biological role for rRFs in association with Argonautes unlikely. Background signals enabled identification of cryptococcal genes for RNase P, U1 snRNA, 37 H/ACA and 63 Box C/D snoRNAs, including U3 and U14 essential for pre-rRNA processing. To gain meaning, high-throughput RNA-Seq analyses need to incorporate negative data. GRAPHICAL ABSTRACT O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=45 SRC="FIGDIR/small/617225v4_ufig1.gif" ALT="Figure 1"> View larger version (15K): org.highwire.dtl.DTLVardef@c44bdcorg.highwire.dtl.DTLVardef@1509468org.highwire.dtl.DTLVardef@13f398borg.highwire.dtl.DTLVardef@1dae4b3_HPS_FORMAT_FIGEXP M_FIG C_FIG

10

Toward a Disease Module for ME/CFS: A Network-Based Gene Prioritization

Maccallini, P.

2025-04-14 genetic and genomic medicine 10.1101/2025.04.13.25325733 medRxiv

Top 0.1%

34.1%

Show abstract

BackgroundMyalgic Encephalomyelitis/Chronic Fatigue Syndrome (ME/CFS) is a debilitating condition with unclear etiology and no FDA-approved treatment. Recent studies suggest a possible genetic contribution to its pathogenesis. ObjectiveThis study aims to identify candidate genes for ME/CFS using both empirical evidence from genome-wide and next-generation sequencing studies on monogenic cases and computational expansion based on protein-protein interaction networks. MethodsTwenty-two genes associated with ME/CFS were identified from relevant literature, including both common and rare variants. These genes were used as seeds in the STRING database to retrieve high-confidence interacting genes. A Random Walk with Restart (RWR) algorithm ranked 1063 candidate genes by their similarity to the seeds. The top 250 ranking genes were selected to define a disease module termed the ME/CFS module. This module was analysed for enrichment in metabolic pathways and disease associations. ResultsEnrichment analysis identified significant overlaps with sphingolipid metabolism and signaling, and energy-related pathways. Heme degradation, TP53-regulated metabolic genes, and thermogenesis were also identified as possibly contributing to the pathogenesis of ME/CFS. Overlaps with metabolic and neurodegenerative diseases were observed. ConclusionThe ME/CFS module captures biologically plausible mechanisms underlying ME/CFS, with a particular focus on lipid and energy metabolism. It also provides a tool for filtering exome and genome data for the study of Mendelian cases of ME/CFS.

11

The rise of sparser single-cell RNAseq datasets; consequences and opportunities

Bouland, G. A.; Mahfouz, A.; Reinders, M. J. T.

2022-05-21 bioinformatics 10.1101/2022.05.20.492823 medRxiv

Top 0.1%

33.8%

Show abstract

There is an exponential increase in the number of cells measured in single-cell RNA sequencing (scRNAseq) datasets. Concurrently, scRNA-seq datasets become increasingly sparser as more zero counts are measured for many genes. We discuss that with increasing sparsity the binarized representation of gene expression becomes as informative as count-based expression. We show that downstream analyses based on binarized gene expressions give similar results to analyses based on count-based expressions. Moreover, a binarized representation scales to 17-fold more cells that can be analyzed using the same amount of computational resources. Based on these observations, we recommend the development of specialized tools for bit-aware implementations for downstream analyses tasks, creating opportunities to get a more fine-grained resolution of biological heterogeneity.

12

Understanding and evaluating ambiguity in single-cell and single-nucleus RNA-sequencing

He, D.; Soneson, C.; Patro, R.

2023-01-04 bioinformatics 10.1101/2023.01.04.522742 medRxiv

Top 0.1%

32.8%

Show abstract

Recently, a new modification has been proposed by Hjorleifsson and Sullivan et al. to the model used to classify the splicing status of reads (as spliced (mature), unspliced (nascent), or ambiguous) in single-cell and single-nucleus RNA-seq data. Here, we evaluate both the theoretical basis and practical implementation of the proposed method. The proposed method is highly-conservative, and therefore, unlikely to mischaracterize reads as spliced (mature) or unspliced (nascent) when they are not. However, we find that it leaves a large fraction of reads classified as ambiguous, and, in practice, allocates these ambiguous reads in an all-or-nothing manner, and differently between single-cell and single-nucleus RNA-seq data. Further, as implemented in practice, the ambiguous classification is implicit and based on the index against which the reads are mapped, which leads to several drawbacks compared to methods that consider both spliced (mature) and unspliced (nascent) mapping targets simultaneously -- for example, the ability to use confidently assigned reads to rescue ambiguous reads based on shared UMIs and gene targets. Nonetheless, we show that these conservative assignment rules can be obtained directly in existing approaches simply by altering the set of targets that are indexed. To this end, we introduce the spliceu reference and show that its use with alevin-fry recapitulates the more conservative proposed classification. We also observe that, on experimental data, and under the proposed allocation rules for ambiguous UMIs, the difference between the proposed classification scheme and existing conventions appears much smaller than previously reported. We demonstrate the use of the new piscem index for mapping simultaneously against spliced (mature) and unspliced (nascent) targets, allowing classification against the full nascent and mature transcriptome in human or mouse in <3GB of memory. Finally, we discuss the potential of incorporating probabilistic evidence into the inference of splicing status, and suggest that it may provide benefits beyond what can be obtained from discrete classification of UMIs as splicing-ambiguous.

13

funMotifs: Tissue-specific transcription factor motifs

Umer, H. M.; Smolinska-Garbulowska, K.; Marzouka, N.-a.-d.; Khaliq, Z.; Wadelius, C.; Komorowski, J.

2019-06-27 genomics 10.1101/683722 medRxiv

Top 0.1%

31.2%

Show abstract

Transcription factors (TF) regulate gene expression by binding to specific sequences known as motifs. A bottleneck in our knowledge of gene regulation is the lack of functional characterization of TF motifs, which is mainly due to the large number of predicted TF motifs, and tissue specificity of TF binding. We built a framework to identify tissue-specific functional motifs (funMotifs) across the genome based on thousands of annotation tracks obtained from large-scale genomics projects including ENCODE, RoadMap Epigenomics and FANTOM. The annotations were weighted using a logistic regression model trained on regulatory elements obtained from massively parallel reporter assays. Overall, genome-wide predicted motifs of 519 TFs were characterized across fifteen tissue types. funMotifs summarizes the weighted annotations into a functional activity score for each of the predicted motifs. funMotifs enabled us to measure tissue specificity of different TFs and to identify candidate functional variants in TF motifs from the 1000 genomes project, the GTEx project, the GWAS catalogue, and in 2,515 cancer samples from the Pan-cancer analysis of whole genome sequences (PCAWG) cohort. To enable researchers annotate genomic variants or regions of interest, we have implemented a command-line pipeline and a web-based interface that can publicly be accessed on: http://bioinf.icm.uu.se/funmotifs.

14

Characterizing Highly Conserved Fragments in 3'UTRs via Computational and Transfer Learning Approaches

Ho, E. S.; Baeck-Hubloux, A.; Dinh, N.; Severino, A.; Troy, C.

2026-01-20 genomics 10.64898/2026.01.19.700376 medRxiv

Top 0.1%

31.1%

Show abstract

3 untranslated regions (3 UTRs) serve as regulatory platforms that modulate translation, mRNA localization, and stability through the binding of regulators, such as RNA-binding proteins (RBPs) and miRNAs, in a sequence-specific manner. These vital binding sites are often identified through orthologous regions among species. A separate but related discovery is the ultraconserved elements (UCEs) detected in human, rat, and mouse genomes two decades ago. However, our knowledge about their functions is limited. Perplexingly, alterations in UCEs in mouse embryos can still produce viable progeny with no observable phenotypic differences. The majority of UCEs are non-coding, though [~]8% are located in the 3UTRs. Given the importance of 3UTRs in gene regulation, we use a computational approach to identify highly conserved fragments (CFs) in 3UTRs across diverse mammals, applying criteria appropriate for 3UTRs (250 bp and 290% identity). Results show that they are not composed of simple repeats or low-complexity regions common to mammalian genomes. Using a transformer-based foundational genomic model, CFs are characterized as A and T-rich and distinguishable from the 3UTR background. 36 human CFs from 25 genes are significantly depleted in variations in humans. They are enriched in neuronal tissues and play roles in neurodevelopment and RNA processing, mediated by RBPs and miRNAs. Our findings expand on existing studies that attribute UCEs primarily to enhancer function, suggesting a new path to explore the biological roles of UCEs in 3UTRs. GRAPHICAL ABSTRACT O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=139 SRC="FIGDIR/small/700376v1_ufig1.gif" ALT="Figure 1000"> View larger version (30K): org.highwire.dtl.DTLVardef@39727forg.highwire.dtl.DTLVardef@18c0374org.highwire.dtl.DTLVardef@136b784org.highwire.dtl.DTLVardef@14a5146_HPS_FORMAT_FIGEXP M_FIG C_FIG Created in BioRender. Ho, E. (2026) https://BioRender.com/dcyrx5f

15

Programmed ribosomal frameshifts, and how to find them

McNair, K.; Salamon, P.; Edwards, R. A.; Segall, A. M.

2023-04-12 bioinformatics 10.1101/2023.04.10.536325 medRxiv

Top 0.1%

30.9%

Show abstract

One of the stranger phenomena that can occur during gene translation is where, as a ribosome reads along the mRNA, various cellular and molecular properties contribute to stalling the ribosome on a slippery sequence, shifting the ribosome into one of the other two alternate reading frames. The alternate frame has different codons, so different amino acids are added to the peptide chain, but more importantly, the original stop codon is no longer in-frame, so the ribosome can bypass the stop codon and continue to translate the codons past it. This produces a longer version of the protein, a fusion of the original in-frame amino acids, followed by all the alternate frame amino acids. There is currently no automated software to predict the occurrence of these programmed ribosomal frameshifts (PRF), and they are currently only identified by manual curation. Here we present the first machine-learning based method to detect and predict the presence of PRFs in all types of coding genes and taxa with an accuracy exceding 90%.

16

GOALS: Gene Ontology Analysis with Layered Shells for Enhanced Functional Insight and Visualization

Yue, Z.; Welner, R. S.; Willey, C. D.; Amin, R.; Chen, J. Y.

2025-04-23 bioinformatics 10.1101/2025.04.22.650095 medRxiv

Top 0.1%

30.9%

Show abstract

Gene Ontologies (GOs) are standardized descriptions of gene functions in terms of biological processes, molecular functions, and cellular components, capturing their Parent-Child relationships in a structured framework and advancing cancer biological modeling to provide consistent and meaningful insights into functional genomics analysis. The conventional GO hierarchical structure is defined by human curation experts, with levels determined by the shortest path to the root term. However, grouping GOs poses challenges due to the uneven distribution of gene members within GO terms and inconsistencies in the level of detail across terms at the same GO level. In this work, we introduce Gene Ontology Analysis using Layered Shells (GOALS), a novel tool that discretizes GOAs into optimal layers. GOALS creates scalable GO layers while maintaining a balanced number of genes across GOs in each layer. Unlike existing tools, the GOALS framework organizes GO terms using a bottom-up approach based on their co-membership network, discretizing GOs to achieve an exponential fit with GOs gene member size. Meanwhile, GOALS reveals clusters or supersets reflecting biological relevance by unsupervised clustering of GOs latent projections. In a case study on mouse natural killer (NK) cell development, GOALS identified distinct GO functional clusters with multi-GO layers to reveal multiple levels of detail from specific to abstract contexts to maximize signal discovery and uncover those signals associations with trajectory divergence. More importantly, GOALS enhances enrichment analysis by introducing additional GO stratification and latent GO map that enables more accurate classification of functional differences. GOALS offers a robust and innovative framework for exploring disordered GO clusters, mining GO activities, and analyzing potential GO-GO interplays. By addressing critical challenges in functional genomics, GOALS provides a powerful tool for advancing our understanding of cell heterogeneity and potentially uncovering actionable insights for therapeutic development.

17

DENetwork: Unveiling Regulatory and Signaling Networks Behind Differentially-Expressed Genes

Su, T.-Y.; Islam, Q. S.; Huang, S. K.; Baglole, C. J.; Ding, J.

2023-06-27 bioinformatics 10.1101/2023.06.27.546719 medRxiv

Top 0.1%

30.8%

Show abstract

Differential gene expression analysis from RNA-sequencing (RNA-seq) data offers crucial insights into biological differences between sample groups. However, the conventional focus on differentially-expressed (DE) genes often omits non-DE regulators, which are an integral part of such differences. Moreover, DE genes frequently serve as passive indicators of transcriptomic variations rather than active influencers, limiting their utility as intervention targets. To address these shortcomings, we have developed DENetwork. This innovative approach deciphers the intricate regulatory and signaling networks driving transcriptomic variations between conditions with distinct phenotypes. Unique in its integration of both DE and critical non-DE genes in a graphical model, DENetwork enhances the capabilities of traditional differential gene analysis tools, such as DESeq2. Our application of DENetwork to an array of simulated and real datasets showcases its potential to encapsulate biological differences, as demonstrated by the relevance and statistical significance of enriched gene functional terms. DENetwork offers a robust platform for systematically characterizing the biological mechanisms that underpin phenotypic differences, thereby augmenting our understanding of biological variations and facilitating the formulation of effective intervention strategies.

18

RIBO-former: leveraging ribosome profiling information to improve the detection of translated open reading frames.

Clauwaert, J.; McVey, Z.; Gupta, R.; Menschaert, G.

2023-06-24 bioinformatics 10.1101/2023.06.20.545724 medRxiv

Top 0.1%

30.7%

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWRibosome profiling is a deep sequencing technique used to chart translation by means of mRNA ribosome occupancy. It has been instrumental in the detection of non-canonical coding sequences. Because of the complex nature of next-generation sequencing data, existing solutions that seek to identify translated open reading frames from the data are still not perfect. We propose RIBO-former, a new approach featuring several innovations for the de novo annotation of translated coding sequences. RIBO-former is built using recent transformer models that have achieved considerable advancements in the field of natural language processing. The presented deep learning approach allows to omit several pre-processing steps as features are automatically extracted from the data. We discuss various steps that improve the detection of coding sequences and show that read length information of all mapped reads can be leveraged to improve the predictive performance of the tool. Our results show RIBO-former to outperform previous methodologies. Additionally, through our study we find support for the existence of translated non-canonical ORFs, present along existing coding sequences or on long non-coding RNAs. Furthermore, several polycistronic mRNAs with multiple translated coding regions were detected.

19

Beyond protein functions: evaluating completeness, coherence, and consistency of genome-scale function annotations

Tawfiq, R.; Kulmanov, M.; Hoehndorf, R.

2025-07-18 bioinformatics 10.1101/2025.07.14.664848 medRxiv

Top 0.1%

30.4%

Show abstract

Protein function annotation has traditionally followed a reductionist approach, assigning functions to individual proteins acting in isolation. This paradigm treats each annotation as an independent fact, disconnected from the broader biological system. However, proteins operate within integrated cellular networks where their functions depend on genomic context and the presence of interacting partners. Here, we develop a genome-scale evaluation framework that assesses whether annotated protein functions could plausibly coexist within a living organism. We formalize three criteria grounded in systems biology principles: completeness (presence of essential functions), coherence (satisfaction of functional dependencies), and consistency (absence of mutually exclusive functions). Applying this framework to bacterial genomes, we evaluated manually curated annotations from six model organisms and computational predictions from six methods. While model organism annotations largely satisfied our constraints -- with violations primarily reflecting host- pathogen interactions -- all computational prediction methods systematically failed to produce biologically plausible genome-scale annotations. Methods achieved high accuracy for individual proteins yet produced incomplete metabolic pathways, incoherent protein complexes, and taxonomically impossible function combinations. These results reveal a fundamental disconnect between the reductionist annotation model and the systems-level requirements of biological organisms. Current computational methods amplify this disconnect as they are optimized for protein-level accuracy while ignoring genome-scale constraints. Our framework provides quantitative metrics for evaluating biological plausibility and establishes a foundation for developing system-aware annotation approaches. The shift from reductionist to systems-level perspectives will be essential for annotating the rapidly growing collection of sequenced genomes and metagenomes. Significance StatementProtein function prediction methods are evaluated by their accuracy on individual proteins, but proteins operate within integrated biological systems with strict functional requirements. We developed a framework that evaluates whether predicted protein functions could plausibly coexist in a living organism by checking for completeness of essential functions, coherence of functional dependencies, and consistency with biological constraints. While manually curated annotations largely satisfy these requirements, all computational prediction methods systematically fail to produce biologically viable genome-scale annotations. This reveals a fundamental disconnect between current evaluation paradigms and the systems-level requirements of biology, highlighting the need for prediction methods that consider genome-scale constraints rather than optimizing for individual protein accuracy.

20

Coverage landscape of the human genome in nucleus DNA and cell-free DNA

Luo, J.; Li, S.

2025-02-07 genomics 10.1101/2024.12.03.626615 medRxiv

Top 0.1%

30.2%

Show abstract

For long, genome-wide coverage has been used as a measure of sequencing quality and quantity, but the biology hidden beneath has not been fully exploited. Here we performed a comparative analysis on genome-wide coverage profiles between nucleus genome DNA (gDNA) samples from the 1000 Genomes Project (n=3,202) and cell-free DNA (cfDNA) samples from healthy controls (n=113) or cancer patients (n=362). Regardless of sample type, we observed an overall conserved landscape with segmentation of coverage, where adjacent windows of genome positions present similar coverage. Besides GC-content, we identified protein-coding gene density and nucleosome density as major factors influencing the coverage of gDNA and cfDNA, respectively. Differential coverage of cfDNA vs gDNA was found in immune-receptor loci, intergenic regions and non-coding genes, reflecting distinct genome activities in different cell types. A further rise in coverage at non-coding genes and intergenic regions plus a further drop of coverage at protein-coding genes and genic regions within cancer cfDNA samples indicated a loss of contribution by normal cells. Importantly, we observed the distinctive feature of coverage convergence in cancer-derived cfDNA, with the extent of convergence positively correlated to stages. Based on the findings, we developed and validated an outlier-detection approach for cfDNA-based cancer screening without the need of cancer samples for training, outperforming current benchmarks on condition-matched and condition-unmatched cancer detection tasks.